| Data type | N of objects | N of timestamps | |
|---|---|---|---|
| Cross-section | many | one | |
| Time-series | one | many | |
| Panel data | many | many |
The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.
data(cars)
head(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
The classic airline data. Monthly totals of international airline passengers, 1949 to 1960.
data(AirPassengers)
AirPassengers
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432
A data frame with 3580 observations on the following 18 variables: id (identifier for panel individual; 716 total), year interviewed (1982, 1983, 1985, 1987, 1988), lwage: ln(wage/GNP deflator), hours (usual hours worked), age (age in current year), educ (current grade completed), etc.
df <- read.csv("http://www.principlesofeconometrics.com/poe5/data/csv/nls_panel.csv")
head(df[,1:6], 10)
## id year lwage hours age educ
## 1 1 82 1.808289 38 30 12
## 2 1 83 1.863417 38 31 12
## 3 1 85 1.789367 38 33 12
## 4 1 87 1.846530 40 35 12
## 5 1 88 1.856449 40 37 12
## 6 2 82 1.280933 48 36 17
## 7 2 83 1.515855 43 37 17
## 8 2 85 1.930170 35 39 17
## 9 2 87 1.919034 42 41 17
## 10 2 88 2.200974 42 43 17
Source: https://rdrr.io/github/ccolonescu/PoEdata/man/nls_panel.html
mtcarsThe data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). ?mtcars for details
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcarsA data frame with 32 observations on 11 (numeric) variables.
mtcarssummary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
mtcarsContinuous variable (mpg), find the mean.
# base-R
mean(mtcars$mpg)
## [1] 20.09062
# dplyr (as a part of tidyverse) #1
library(tidyverse)
mtcars %>% select(mpg) %>% summarise(mean = mean(mpg)) %>% pull()
## [1] 20.09062
# dplyr (as a part of tidyverse) #2
summarise(mtcars, mean(mpg)) %>% pull()
## [1] 20.09062
# dplyr + base-R
mean(mtcars %>% select(mpg) %>% pull())
## [1] 20.09062
mtcars %>% select(mpg) %>% pull() %>% mean()
## [1] 20.09062
mtcarsBinary variable (vs): Engine (0 = V-shaped, 1 = straight), find the mean.
mean(mtcars$vs)
## [1] 0.4375
Mean mpg by vs type
# base-R (data.frame object)
aggregate(mpg ~ vs, data=mtcars, mean)
## vs mpg
## 1 0 16.61667
## 2 1 24.55714
# tidyverse (tibble object)
mtcars %>% group_by(vs) %>% summarise(mean(mpg))
## # A tibble: 2 x 2
## vs `mean(mpg)`
## <dbl> <dbl>
## 1 0 16.6
## 2 1 24.6
mtcarsLet's create the factor as car manufacturer.
# row.names contain:
mtcars %>% row.names() %>% head(3)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
# the same with base-R
head(row.names(mtcars), 3)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
# split by space (base-R list object)
mtcars %>% row.names() %>% strsplit(split = ' ') %>% head(3)
## [[1]]
## [1] "Mazda" "RX4"
##
## [[2]]
## [1] "Mazda" "RX4" "Wag"
##
## [[3]]
## [1] "Datsun" "710"
# extract the first element after split and assign to variable
mtcars %>% row.names() %>% strsplit(split = ' ') %>% map(1) %>% unlist() -> manufacturer
# the same assignment
manufacturer <- mtcars %>% row.names() %>% strsplit(split = ' ') %>% map(1) %>% unlist()
# and display
manufacturer %>% head(3)
## [1] "Mazda" "Mazda" "Datsun"
# create the column
mtcars %>% mutate(manufacturer) %>% head()
## mpg cyl disp hp drat wt qsec vs am gear carb manufacturer
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant
mtcarsFind mean mpg by manufacturer.
mtcars %>% mutate(manufacturer) %>%
group_by(manufacturer) %>% summarise(mean1 = mean(mpg)) %>%
arrange(desc(mean1)) %>% head()
## # A tibble: 6 x 2
## manufacturer mean1
## <chr> <dbl>
## 1 Honda 30.4
## 2 Lotus 30.4
## 3 Fiat 29.8
## 4 Toyota 27.7
## 5 Porsche 26
## 6 Datsun 22.8
# group_by two factors
mtcars %>% mutate(manufacturer) %>%
group_by(manufacturer, am) %>% summarise(mean1 = mean(mpg)) %>%
arrange(desc(manufacturer)) %>% head()
## # A tibble: 6 x 3
## # Groups: manufacturer [5]
## manufacturer am mean1
## <chr> <dbl> <dbl>
## 1 Volvo 1 21.4
## 2 Valiant 0 18.1
## 3 Toyota 0 21.5
## 4 Toyota 1 33.9
## 5 Porsche 1 26
## 6 Pontiac 0 19.2
Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html#correlation-covariance-and-linearity
Source: https://www.econometrics-with-r.org/3.7-scatterplots-sample-covariance-and-sample-correlation.html
mtcars# Correlation between mpg (Miles/gallon) and wt (Weight)
cor(mtcars$mpg, mtcars$wt)
## [1] -0.8676594
plot(mtcars$wt ~ mtcars$mpg,
xlab='Miles/(US) gallon',
ylab='Weight (1000 lbs)')
abline(lm(mtcars$wt ~ mtcars$mpg), col='blue')
mtcarsTest for significance (the null hypothesis is about no linear relationship)
cor.test(mtcars$mpg, mtcars$wt)
##
## Pearson's product-moment correlation
##
## data: mtcars$mpg and mtcars$wt
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
# extract p-value and compare
cor.test(mtcars$mpg, mtcars$wt)$p.val < 0.05
## [1] TRUE
mtcarsTest for significance (the null hypothesis is about no linear relationship)
# Rear axle ratio (drat) vs. 1/4 mile time (qsec)
cor.test(mtcars$drat, mtcars$qsec)
##
## Pearson's product-moment correlation
##
## data: mtcars$drat and mtcars$qsec
## t = 0.50164, df = 30, p-value = 0.6196
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.265947 0.426340
## sample estimates:
## cor
## 0.09120476
# extract p-value and compare
cor.test(mtcars$drat, mtcars$qsec)$p.val < 0.05
## [1] FALSE
plot(mtcars$qsec ~ mtcars$drat)
abline(lm(mtcars$qsec ~ mtcars$drat), col='blue')
Correlation provides a measure of the linear association between pairs of variables, but it doesn’t tell us about more complex relationships.
You can use regression to develop a more formal understanding of relationships between variables. In regression, and in statistical modeling in general, we want to model the relationship between an output variable, or a response/dependent, and one or more input variables, or factors/independent variables.
Source: https://www.jmp.com/en_ch/statistics-knowledge-portal/what-is-regression.html
Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html#correlation-covariance-and-linearity
But the genuine mechanics is to minimize the sum of squared distances, and, then fit the line
Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html#correlation-covariance-and-linearity
Example mtcars
Estimate regression coefficients and decide within the null hypothesis \(H_0:b_j=0\) (no relationship)
# wt (Weight) is dependent variable
# mpg (Miles/Gallon) is independent variable
lm(wt ~ mpg, data=mtcars) %>% summary()
##
## Call:
## lm(formula = wt ~ mpg, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6516 -0.3490 -0.1381 0.3190 1.3684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.04726 0.30869 19.590 < 2e-16 ***
## mpg -0.14086 0.01474 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4945 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Common representation in articles
library(stargazer)
model1 <- lm(wt ~ mpg, data=mtcars)
stargazer(model1, type = 'text')
##
## ===============================================
## Dependent variable:
## ---------------------------
## wt
## -----------------------------------------------
## mpg -0.141***
## (0.015)
##
## Constant 6.047***
## (0.309)
##
## -----------------------------------------------
## Observations 32
## R2 0.753
## Adjusted R2 0.745
## Residual Std. Error 0.494 (df = 30)
## F Statistic 91.375*** (df = 1; 30)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
More info: https://bookdown.org/yihui/rmarkdown-cookbook/kable.html
Estimations from the previous example
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.845 -10.240 -0.308 9.815 43.461
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.03225 7.41131 92.566 < 2e-16 ***
## x1 -1.10130 0.38028 -2.896 0.00398 **
## x2 -0.64978 0.03934 -16.516 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared: 0.4264, Adjusted R-squared: 0.4237
## F-statistic: 155 on 2 and 417 DF, p-value: < 2.2e-16
model2 <- lm(wt ~ mpg + hp + cyl, data=mtcars)
stargazer(model1, model2, type = 'text')
##
## =================================================================
## Dependent variable:
## ---------------------------------------------
## wt
## (1) (2)
## -----------------------------------------------------------------
## mpg -0.141*** -0.125***
## (0.015) (0.029)
##
## hp -0.002
## (0.002)
##
## cyl 0.135
## (0.112)
##
## Constant 6.047*** 5.186***
## (0.309) (1.130)
##
## -----------------------------------------------------------------
## Observations 32 32
## R2 0.753 0.766
## Adjusted R2 0.745 0.740
## Residual Std. Error 0.494 (df = 30) 0.498 (df = 28)
## F Statistic 91.375*** (df = 1; 30) 30.481*** (df = 3; 28)
## =================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note: Measure of fit \(R^2\) (the coefficient of determination) is the fraction of the sample variance of dependent variable that is explained by the factors.
Pay attention to the following:
Source: https://scpoecon.github.io/ScPoEconometrics/linreg.html